Variations in individuals’ metabolic profiles are the result of their genetic makeup, environmental, and lifestyle factors. Hence, we have developed the MetaboVariation package to facilitate the flagging of individuals with intra-individual variation in their metabolite levels using repeated measures data.
Expanding on our previous work with the univariate MetaboVariation approach [1], we now employ a multivariate Bayesian generalised linear model (BGLM) that considers dependencies among metabolites. This multivariate approach allows us to flag individuals by considering all metabolites and their inherent dependencies, unlike previous research that focused on a single metabolites independently. Consequently, individuals whose observed metabolite levels deviate from their individual posterior predictive interval at a specified time point are flagged. The MetaboVariation package is built on the MCMCglmm package [2], which fits the multivariate BGLM. The approach is flexible, allowing for the consideration of multiple posterior predictive interval widths, and a detailed examination of metabolite variations at the individual level.
This document gives an overview of MetaboVariation
functionalities. See help
(package = MetaboVariation) for further details and
references available via citation("MetaboVariation").
To use the MetaboVariation package, the user should have the R software environment installed on their machine. The MetaboVariation package has the following dependencies which will be installed along with the package if not already installed on the machine:
To install and load the MetaboVariation package:
The object metabol.data is a simulated dataset that
contains metabolite levels of 5 metabolites for 150 individuals measured
at four time points. The covariates sex, age and BMI are also available
for these individuals.
To load the data and examine its column names type:
data(metabol.data)
colnames(metabol.data)
#> [1] "Individual_id" "SexM.1F.2" "Age" "BMI"
#> [5] "metA_1" "metB_1" "metC_1" "metD_1"
#> [9] "metE_1" "metA_2" "metB_2" "metC_2"
#> [13] "metD_2" "metE_2" "metA_3" "metB_3"
#> [17] "metC_3" "metD_3" "metE_3" "metA_4"
#> [21] "metB_4" "metC_4" "metD_4" "metE_4"The subject IDs of individuals are stored in the column named
Individual_id, while information on sex, age, and BMI are
stored in SexM.1F.2, Age, and
BMI, respectively. The column name metX_Y
represents the level of metabolite X at time point
Y.
To extract the names of metabolites that are present in the data, the
function get.metabolites reads the names of columns that
contain metabolite values and provides a list of unique metabolites. The
user should not pass any unnecessary column names (e.g., any column
names that relate to covariate or individual ID information) to the
function. The function will return a vector containing the names of
unique metabolites. To do this type:
The user can plot the distributions of metabolites for each time
point using the function metabolite.plot. The user can pass
either a single metabolite or multiple metabolites to this function. For
example, to view the distribution of the metaboliteA metabolite
type:
Specifications of the priors used in the BGLM requires consideration. The MetaboVariation package allows you to specify your own prior parameters based on specific settings, or alternatively, default settings are available.
You can specify your own prior parameters using the following structure, or let the package calculate them automatically based on fitting an independent BGLM to the metabolite data.
# Set the prior mean vector with zeros, one for each metabolite
prior_mean <- rep(0, length(metabolites))
# Set the prior variance vector with ones, one for each metabolite
prior_V <- rep(1, length(metabolites))
# Set the prior residual covariance matrix as an identity matrix
prior_R <- diag(1,length(metabolites))
# Set the prior random effects covariance matrix as an identity matrix
prior_G <- diag(1,length(metabolites))
# Combine all the priors into a list
prior <- list(
B = list(mu = prior_mean, V = diag(prior_V)),
R = list(V = prior_R, nu = length(metabolites) * 1.5),
G = list(G1 = list(V = prior_G, nu = length(metabolites) * 1.5))
)B: This is the prior for the regression coefficients, assumed to follow a multivariate normal distribution with prior mean of zero and prior covariance matrix set to a diagonal matrix where the diagonal elements are the variances of the regression coefficients.
R and G: These are assumed to follow inverse Wishart prior distributions. The scale matrix V for both R and G is set to an identity matrix. The degrees of freedom (nu) are set to 150% of the number of metabolites (M) to balance between overfitting and capturing metabolite dependencies effectively.
If nu is too close to the number of metabolites, the resulting prior distribution would be very wide, indicating high uncertainty and potentially underfitting the data. Conversely, if nu is too large, the distribution would be tightly concentrated around the scale matrix, potentially leading to overfitting by not allowing enough flexibility to capture the true variability in the data.
Under default settings, the package calculates the priors for B based on fitting an independent BGLM to the metabolite data. As for R and G, the scale matrix is a dense matrix with diagonal terms as the variance of metabolites calculated by fitting an independent BGLM to the metabolite data and non-zero off-diagonal terms. The off-diagonal values are set to 0.1. The signs of these off-diagonal terms match the signs from the correlation matrix of the metabolite data.
The main function, MetaboVariation, performs the
modeling. The function requires the data, the names of covariate columns
(if any), the individual ID column and list of metabolites as arguments.
To find the individuals with intra-individual variation, the BGLM
provides the posterior predictive distribution for every individual at
each time point, for each metabolite. These are used to identify
individuals with observed metabolite levels outside the
interval.width% highest posterior density (HPD) interval at
one time point; such individuals with notable variation are flagged. The
width of the HPD interval is set by the interval.width
argument where the default is a vector containing three values
c(0.95,0.975,0.99). The function returns summarised results
that contains the upper and lower limit of the HPD intervals and whether
the individual is flagged or not under each HPD interval width.
By setting type="dependent",
MetaboVariation models the data using a multivariate model
that takes dependencies between metabolites into account.
To fit the model, type:
model = MetaboVariation(data = metabol.data, individual_ids = "Individual_id",
metabolite = metabolites,covariates = c("SexM.1F.2","Age","BMI"),
interval.width =c(0.95,0.975,0.99),type="dependent")The function returns a list that contains the following values:
See ?MetaboVariation for further details about the
function.
The following code displays the first five observations flagged within the 95% HPD intervals. The row names indicate the individual, the time point, and the metabolite for which they have been flagged. The output includes three HPD interval widths (0.95, 0.975, 0.99), providing the lower and upper bounds for each of these HPD intervals. Additionally, there is a flag column that indicates whether the individual is flagged within each HPD interval.
head(model$result[model$result[,"flag0.95"]==1,])
#> mean lwr0.95 upr0.95 lwr0.975 upr0.975 lwr0.99
#> 11 1 metB 46.71275 43.71696 49.31910 43.69325 49.92331 42.82390
#> 109 1 metC 95.61841 93.16444 98.53489 92.39658 98.53489 92.10585
#> 110 1 metE 300.23288 297.58099 303.30789 297.16379 303.77024 296.80246
#> 122 1 metA 369.93380 367.38293 372.97372 366.77254 373.13867 366.23741
#> 126 1 metA 368.50945 365.82566 371.40447 365.33812 371.54775 365.10780
#> 127 1 metA 357.92564 355.19356 361.22895 354.79492 361.58909 353.99554
#> upr0.99 original flag0.95 flag0.975 flag0.99
#> 11 1 metB 49.97699 50.63484 1 1 1
#> 109 1 metC 99.36083 93.13228 1 0 0
#> 110 1 metE 304.23929 296.13965 1 1 1
#> 122 1 metA 373.34641 373.08306 1 0 0
#> 126 1 metA 371.96070 371.55486 1 1 0
#> 127 1 metA 361.79758 354.86673 1 0 0Another way of fitting the model is independently. By setting
type="independent", MetaboVariation treats
each metabolite independently.
To fit the model, type:
Now that we have two different models—one that considers dependencies
between metabolites and one that treats metabolites independently—it is
advisable to perform a posterior predictive check to determine which
model better fits the data. The function predictive_check
performs a posterior predictive check assess the fit of the model. The
function generates 100 replicate datasets by default from the posterior
predictive distributions. The function then provides a box plot that
shows the mean absolute difference (MAD) between correlation matrices of
the original dataset and those obtained from the replicated datasets
over each time point. The function additionally returns the actual
values of MAD for each replicate data set at each timepoint. The
following code performs the posterior predictive check for both
multivariate model and independent model.
multivariate_check = predictive_check(data = metabol.data, model = model,
timepoints = c(1,2,3,4), metabolites = metabolites, replication = 100)
independent_check = predictive_check(data = metabol.data, model = independent_model,
timepoints = c(1,2,3,4), metabolites = metabolites, replication = 100)The following code shows the boxplot of posterior predictive check of multivariate model.
The following shows the first six rows of MAD values.
#> Timepoint MAD
#> 1 1 0.09471758
#> 2 1 0.07928552
#> 3 1 0.09797898
#> 4 1 0.08355929
#> 5 1 0.07297066
#> 6 1 0.09531774
The following code compares the model fit between multivariate and independent models by using the MAD values from each model’s posterior predictive check.
# Setting model type for each MAD dataframe
multivariate_check$mad$Model = "Multivariate"
independent_check$mad$Model = "Independent"
plotdf = rbind(multivariate_check$mad,independent_check$mad)
ggplot2::ggplot(plotdf,ggplot2::aes(x=Timepoint,y=MAD,color = Model))+
ggplot2::geom_boxplot() + ggplot2::labs(colour ="Model",x="Timepoint",y="MAD")
+ ggplot2::theme_classic()The multivariate model shows lower MAD at each time point, suggesting a better fit compared to the independent model. Therefore, we will proceed to examine the results of the multivariate model.
After fitting the BGLM, the user can use the following plotting functions to visualise the results.
The function radar.plot visualises the metabolic profile
of an individual across all time points and all metabolites. This
function requires a MetaboVariation object, a specific
interval width and the individual ID to
generate the radar plot.
Here’s how you can use the radar.plot function:
The radar plots display intra-individual variations in metabolite levels across four time points, highlighting significant variations. Metabolites marked with an asterisk (*) are flagged for intra-individudal variation at each respective time point.
The metabolite.heatmap function visualises the flagged
individuals across all metabolites. This helps in identifying patterns
or variations in certain metabolites. This function needs a
MetaboVariation object, along with
interval width and a threshold argument to
generate the plot. The argument threshold indicates the
minimum number of metabolites an individual must be flagged in to be
shown in the heatmap.
The plot above shows the individuals that have been flagged in at least two metabolites. The darker blue shows the metabolites where the individual is flagged in two time points out of four.
The metpair.heatmap function visualises the number of
flagged individuals in pairs of metabolites (in the corresponding row
and column) that contributeto the flagging of the individuals within a
single time point.
This function needs a MetaboVariation object, along
with a specific interval width to generate the plot. The
interval refers to the width of the highest posterior density (HPD) that
you want to visualise.
To plot the heatmap, type:
The heat map illustrates the number of individuals flagged for intra-individual variation across pairs of metabolites within a time point. Darker shades of red indicate higher numbers of flagged individuals. Notably, 15 individuals are flagged for the pair metE and metC while 5 individuals are flagged for the pairs metA and metC, and metA and metE. This visualisation underscores the need for a comprehensive multivariate analysis, as multiple metabolite pairs contribute to identifying variations.
The circos.plot plot shows the posterior predictive HPD
interval for all time points for all individuals for a single
metabolite. The outer circle shows the individuals’ labels while each
inner circle represents a time point. The length of a bar represents the
width of the posterior predictive interval and a black bar indicates
that a particular individual has been flagged.
The circos plot visualises the 97.5% posterior predictive HPD intervals for all individuals across four time points for a single metabolite “metA”. From the plot, 14 individuals are flagged: 2 at time point 1 (blue), 4 at time point 2 (cyan), 6 at time point 3 (green), and 2 at time point 4 (black).
[1] Gupta, S., Gormley, I. C., & Brennan, L. (2023). MetaboVariation: Exploring Individual Variation in Metabolite Levels. Metabolites, 13(2), 164.
[2] Hadfield, J. D. (2010). MCMC Methods for Multi-Response Generalized Linear Mixed Models: The MCMCglmm R Package. Journal of Statistical Software, 33(2), 1–22.
[3] Gelman, A., & Rubin, D. B. (1992). Inference from Iterative Simulation Using Multiple Sequences. Statistical Science, 7(4), 457–472.